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During the pest decade evaluators of programmed instruction and 



computer-aided instruction have recognized that it is v ery difficult, 
if not impossible, to determine subjectively the effectiveness of test 



items and instruction (see Rothkopf, 1963). In this paper we will 



specify a set of objective rules, based upon item performance date, for 



identifying those test items and sections of instruction that seem to 
require revision. This objective method should provide a more rational 
basis for decision-making than the subjective method of making decisions 
based upon some unidentified combination of subject matter knowledge, 



experience, and intuition. 

The rationale for dec.Gion-making that we propose is basically 
an elaboration of a technique devised by Stolurow and Frase (1968), 



Iheir method J* based upon a comparison of three different types of 



error rates for program frames: (a) the theoretical error rate (T) , 
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which is the error rate expected simply the basis of random guessing; 

(b) the base error rate (B) which is the error rate obtained by students 
not exposed to the teaching material for the frame; and (c) the instruct 
tional error rate (I) , which is the error race obtained by students who 
have been exposed to the instruction. 

In this paper we will treat not only program frames that are an 
integral part of instruction but also test items that occur both before 
and after instruction. In addition, we will use both error rates and 
discrimination indices as data for decision-making, 

In order to puc the decision process we propose into a conceptual 
context, let us assume that we have an instructional program teaching 
a set of terminal objectives. Chronologically, each terminal objective 
is tested by (a) a pretest item that occurs before tht objective has 
been taught, (b) a terminal test item that occurs almost immediately 

after the objective has been taught, and (c) a posttest item that occurs 

2 

"some time after" the objective has teen taught. Without loss of gene- 
rality, we will assume (as is usually the case) that the set of pretest 
and postte.st items form two tests that occur, respectively, prior to and 
follov/Ing the instruction for all objectives. Furthermore, we will 
assume that all of the items testing any objective are either identical 
or "corresponding". (The concept of "corresponding" items will be 
treated in detail later; however, we can roughly define corresponding items as 
items that test the same content at the same level of difficulty.) In 
the final analysis, using item performance data, we want to ider'ify 
those test items and sections of instruction (relevant to a given objective) 
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that require revision. The decision process we propose will not neces- 
sarily tell the evaluator how to revise items and/or instruction, but 
the process wilJ provide objective rules for deciding what to revise. 



Types of Data and Decisions 



Error rate is defined as the proportion cf students getting 
an item incorrect, i.e.. 



Error Rate = N umber of Incorrect Answers 
Total Number of Answers 



( 1 ) 



N 




N 



( 2 ) 



where ERj means error rate for item i, N is the total number of students 
answering the item, = 1 if student j gets item i correct, and - 0 

if student j gets item i wrong. We can also express Equation 2 as: 



ER 



i 




(3) 



Since the last term on the right of Equation 3 is item difficulty level 
(DL^), it is clear that 

Ek A - 1 - DL i ; <*) 
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i.e. , error rate equals one minus difficulty level. Clearly, Equation A 
shows that from a theoretical viewpoint it is immaterial whether we use 
difficulty level or error rate; however, using error rate seems to faci- 
litate an understanding, of some of the decisions that will be proposed 
later. 

In much of what follows we will assume that error rates are 

/ 

classified as either high (H) or low (L) , and that the evaluator prede- 
termines an appropriate cut-off point between high and low error rate. 

For any given objective, the cut-offs for TER, BER, IER, and PER must be 
identical in order to apply the rules that will be specified. Also, in 
most cases, the cut-offs chosen will probably be tha same for all objec- 
tives; however, occasions can arise when certain objectives should have 
a higher (or lower) error rate cut-off than other objectives. For example, 
items testing eery crucial objectives might be assigned a cut-off of 
0.90, while other i'rems might have a cut-off of 0.70. 

Discrimination indices will be classified as eicher positive 
(+) , negative (-) , or non-discriminating (0). By positive and negative 
indices we mean indices that discriminate significantly (at some appro- 
priate - level) in the positive and negative directions, respectively. 
The discrimination index used should, of course, be appropriate for the 
data in question. 

Before instruction we can obtain three types of data for each 
objective that has :* pretest item: 

(a) the Theoretical Error Rate (TER), which is the expected 
propostion of students getting a pretest item incorrect simply on the 
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basis of random guessing; i.e., If K Is the number of possible answers 
to an Item, then 

TER = K - 1 . (5) 

K 

For example, If an Item has five alternatives, we would expect 80 per- 
cent of the students to get the item Incorrect simply by guessing, with- 

3 

out any knowledge of the objective tested by the item ; 

(b) the Base Error Rate (BER) , which is the observed propor- 
tion of students getting a pretest item incorrect; and 

(c) the Base Discrimination Index (BDI), which is the discri- 
mination index for a pretest item. (Vte will use total score on the pre- 
test as the criterion variable for BDI.) 

After instruction we can obtain two types of data for each 

objective that has a posttest item: (a) the Posttest Error Rate (PER), 

and (b) the Fosttest Discrimination Index (PDI). (Total score on the 

posttest will be used as the criterion variable for PDI.) 

Immediately following the instruction for any objective we can 

obtain the Instructional Error Rate (IER) , which is the error rate on a 

4 

terminal test Item for a given objective . Note that IER refers to the 
error rate on a terminal item, not the. error rate on other questions 
associated with teaching the given objective* We will not consider 
Instructional Discrimination Index since, in our opinion, it does not 
seem to be very useful for making decisions beyond those that can be 
-made with the other types of data. 
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In subsequent sections we will analyze the decisions that can 
be made on the basis of: (a) pretest data, alone; (b) posttest data, 

alone; (c) pretest and posttest data; and (d) pretest, posttest, and 
instructional test data. In this way, the contribution of the various 
types of data to the decision process should be evident. For each anal- 
ysis, we will specify reasons for determining whether _est items or 
instruction relevant to a given objective should be revised ( R) , ques- 
tioned (?) , or not revised (NR)^. Since we are assuming that all items 
testing a given objective are identical or "corresponding" , a decision 
about item revision applies equally to all items testing the objective 
in question. For example, if on the basis? of pretest data it is clear 
that an item should be revised, we must also revise the corresponding 
terminal test item and posttest item. Thu?, when we say that an item 
should be revised, we mean that all items testing the given objective 
should be revised. Likewise, when ve say that instruction should be 
revised, we mean that that part of the instructional system that attempts 
to teach the given objective should be revised. 
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Pretest Data 

Prior to instruction we can collect three sets of data: 
Theoretical Error Rate (TER), Base Error Rate (BER) , and Base Discri- 
mination Index (BDI) * Given these three sets of data, various 
reasonable rules can be formulated for making decisions about whether 
or not to revise test items. It is not likely that only pretest data 
would be used to make cecisions about items, yet it is useful to con- 
sider the types of decisions that are appropriate on the basis of 
such data. 

Ru le 1 . If TER and BER are both the same {i.e., H, H or 
L, L) , then no necessity for revision is indicated. In this case, 
the observed error rate (BER) without benefit of instruction is approx- 
imately the same as the expected error rate (TER). 

Rule 2 . If TER is low (L) and BER is high (H), then no re- 
vision is indicated. This anomalous case could arise if the particu- 
lar objective for the item involved concepts that are typically mis- 
understood. For example, many students (in the authors* opinion) 
believe that "inflammable" and "non-flammable 11 have different meanings. 
If an item were constructed testing whether or not "flammable" and 
"inflammable" have the same meaning, and if this item were given prior 
to instruction, it is quite possible that more students would get the 
item incorrect than we would expect on the basis of the theoretical 
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error rate (TER), In this case, there is no reason to revise the item; 
rather, we expect that the instruction will correct the students 1 mis- 
conception. 

Rule 3 . If TER is high (H) , and BER is low (L) , then the 
item will probably need to be revised. In this case, students, with- 
out benefit of instruction, are performing considerably better than 
expected. It appears that the item itself may be teaching or that the 
distractors are so easy that most students can pick the correct answer 
by the process of elimination. In either case, the item should be re- 
vised.^ 

Rule 4. If an item is negatively discriminating before 
instruction, then the item is questionable in that it may need revision. 
If, however, the item is positively discriminating or non-discriminating, 
then no revision is indicated. A negatively discriminating item is 
questionable since it indicates that the worse students (on the basis 
of total test score) are out-performing the better students; however, a 
situation similar to thal indicated in Rule 2 could be the cause of the 
negative discrimination Index. A positively discriminating item is 
quite possible and reasonable prior to instruction simply because some 
good students are usually expected to perform better than chance on a 
pretest. A non-discriminating item is the best of all possibilities. 

Rule 5 . If an item is positively or negatively discriminating 
before instruction, then the prerequisites for the objective tested by 
the item should be checked. Clearly, whenever an item is discriminating 
(either positively or negatively) one group (upper or lower) is outper- 
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forming the other group (lower or upper). In such a case, it seems 
reasonable to check whether or not the group with the higher error 
rate does, in fact, possess the prerequisites necessary to achieve the 
given objective. 



Insert Table 1 about here 



These rules, as well as all other rules that will be dis- 
cussed, are given in abbreviated form in Table 1. 

Posttest Data 

As a result of administering a posttest two types of data 
can be collected: the Tcsttest Error .Rate (PER) and the Posttest 

Discrimination Index (PDI). Since these data are collected after 
instruction, theoretically decisions can be made about both items and 
instruction; however, it is very difficult to identify items and 
instruction that should be revised solely on the basis of posttest. data. 
In almost every case, we can say* whether or not there is something 
wrong, but we cannot pinpoint Che problrn. 

Rule 6 . If PER 11 L and PDI ■ 0, then neither the item nor 
the instruction need to be revised* This is the best possible situation, 
since the optical conditions for both error rate and discrimination 
index are fulfilled; i,e. , at the end of instruction we hope that most 
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of the student: get the posttest item correct (PER = L) , and that the 
item is non-discriminating (PER = 0). (Later we will discuss our 
reasons for preferring non-discriminating items.) 

Rule 7 . If PER = L and PDI 3 + or then both the item and 
the instruction are questionable. The fact that PDI is clearly non- 
zero indicates a possible need for revision. 

Rule $ ■ If PER « H and PDI * -, then both the item and 

instruction should be revved, since PER - H and PDI ■ - is th* worst 

possible situation that can occur. It is possible that either the 
item or the instruction is at fault, but not both; however, we assume 
here that the most universally applicable decision is to check both the 
item and the instruction to see what revisions are needed. 

Rule 9 . If PER = K and PDI - + or then the instruction 

should be revised and the item should be questioned. Whenever error 

rate is high after instruction, something is wrong, but without addi- 
tional information we do not know whether the fault definitely lies 
with the item or the instruction. However, the authors believe that 
evaluators are often more confidert about the test items than they are 
about the instruction; it is also possible that the test items have 
been previously validated or partially validated. Therefore, in this 
case, it seems reasonable to place a less stringent decision on the 
Item than on the instruction. 

Rule 10 . When PDI «= + or PDI * -, then the prerequisites for 
the objective tested by the item should be examined. The reason for this 
decision is identical to that presented in Rule 5 in the previous section. 
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Pretest and Posttest Data 

It is evident from Table I that neither the pretest data 
alone (see Rules 1-5) nor the posttest data alone (see Rules 6-10) 
give tha evaluator much indication about which items and/or sections 
of instruction should be revised. Clearly, more meaningful decisions 
can be made by combining the two sets of data, when this is done all 
of the ruW discussed in the last two sections are applicable, with 
the exception of Rule 5 which is superseded by Rule 10. In addition, 

one more rule can be specified* 

Rule 11 . If 8DI = - and PDI « then the Item should be 

revised. Both before and after instruction the item is negatively 
discriminating, which means that the upper group (based on total test 
score) has a proportionately higher error rate than the lower group. 
This clearly is an unfortunate circumstance indicating that the item 
should be revised. 

Pretest, Posttest, and Terminal Item Data 

Recall that Instructional Error Rate (IER) is the erior rate 
on a terminal item immediately following instruction. If, in addition 
to pretest and posttest data, we also take into account IER, It is 
possible to make fairly definite statements about whether or not to 
revise most segments of instruction that are related to terminal objec- 
tives. The addition of IER does not, however, tel] us much more about 
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the revision of items than we already know from pretest and posttest 
data. All of the rules previously specified are applicable except for 
Rule 5 which is superseded by Rule 10. Also, we can specify four 
additional rules. 

Rule 12 . If Instructional Error Rate (IER) and Posttest 
Error Rate (PER) are low, then no revision (NR) of instruction is indi- 
cated. Both during instruction and after instruction most of the stu- 
dents seem to achieve the objective (tested by the instructional item 
and the posttest item); therefore, we have two Indications that the 
instruction is adequate, and no revision is indicated. 

Rule 13» If IER ■» L and PER = H, then the instruction should 
be revised. During instruction students seem to achieve the objective, 
but on the por, ttest the same students have a higher error rate for the 
same objective. Thus the data indicate a retention problem, and the 
instruction should be revised to correct this situation. Perhaps more 
review is needed. 

Rule 14 . If IER « H and PER = L, then the instruction should 
be questioned. This is probably on unlikely situation that would sel- 
dom occur in practice. Howe%'er, the fact that students experience a 
high error rate or a terminal item during instruction seems to in- 

dicate that something mav be wrong with the instruction.^ 

Rule 1$. If IER e H and PER = K, then the instruction defi- 
nitely should be revised. Both during end after instruction students 
do not reem to achieve the objective under consideration. We, therefore, 
have ts:o indications of a need for revising the instruction. 

o 
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Decisions Based Upon Differences 
Between Error Rates 

Most of the foregoing decision rules are dependent upon the 
evaluator f s choice of a cut-off between high and low error rate. Dicho- 
tomizing error rate in this way clearly facilitate! the identification 
of appropriate decision rules, and, in many cases, the simplicity of 
the technique will probably outweigh any loss of precision. However, 
we can also specify an additional set of four useful decisions rules 
that take into account quantitative differences between error rates. 
Three of these rules increase the power of previous decisions, the 
other provides essentially new information. We will call these error 
rates "derived" error rates in order to distinguish them from the "raw" 
error rates discussed in the previous sections. 

Let us consider several limitations of the high/low classi- 
fication procedure for error rates. Suppose that Theoretical Error 
Rate (TER) and Base Error rite (BER) for a given objective are both 
classified as high (H) , while Instructional Error Rate (IER) and Post- 
test Error Rate (PER) are both classified as low (L) . Clearly, any 
actual arithmetic differences between TER and BER, as well as between 
IER and PER, will not affect the decisions we have thus far proposed. 
Also, 3ince BER and IER a**e merely classified as high and low, respec- 
Mvely, we won't have a quantitative measure of how much learning has 
actually taken place. 
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Difference Error Rate 

Rules 1-3 are useful for making decisions based upon cate- 
gorical differences between BER and TER, but we car make more accurate 
decisions by actually computing the differences between these error 
rates. Let 

DER * TER - BER, (6) 

where DER stands for '‘Difference Error Rate”. If DER « 0, then the 
observed error rate (BER) on the pretest item in question is identical 
to the expected error rate (TER). If DER ^ 0, then fewer students 
are getting the item correct than we would expect on the basis of ran- 
dom guessing. Finally, if DER > 0, then more students are getting 
the item correct than we would expect. As discussed previously, the 
last possibility is often an unfavorable situation, since it can mean 
that the item somehow "gives away" the correct answer. 

We can test the signifies ice of a positive difference between 
BER and TER by computing 



2 - DER ‘ - I' 2 * , ( 7) 

VTER(1 - TER) /N 

where N is the total number of students in the sample (see Snedecor & 

0 

Cochran, 1967, p. 210). The computed Z value is then compared with 
the normal curve standard score at an appropriate level of significance 
for a one-tailed test. (Note that we are interested only in positive 
values of DER.) We can now specify a more precise rule to replace 
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Ruler 1-3. 
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Rule 16 . If the value of DER is significantly less than 
zero, then the item should be revised. In all other cases no revision 
is required. 

Retention Error Rate 

Rules 12 - 15 are useful for making decisions based upon 
categorical differences between IER and PER, but we can supplement 
these decisions by calculating the actual difference between IER and 
PER and comparing this value to some preassigned cut-off. Let 

RER = PER - IER, ' (8) 

where RER stands for "Retention Error Rate". If RER * 0, then the 
number of errors on the posttest item and the related terminal item is 
identical, and no retention problem is evident. If RER ^ 0 then stu- 
dents make more errors on the posttest item than on the terminal item. 

The latter situation can be serious if RER is considerably greater than 
zero; however, it is not clear how to define "considerably greater than 
zero". 

We can, of course, test the statistical significance of RER 
if certain distributional assumptions can be made, but such a test would 
not, in our opinion, provide a meaningful basis for decision. What is 
needed is a cut-off above which the amount of forgetting is great enough 
to Justify revision of instruction. Such a cut-off must take into 
account the criticality of forgetting which in turn is dependent upon 
many factors including the content matter of the instructional system 
and the population for which the system is being developed. Furthermore, 
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there is no theoretical rationale for specifying the same cut-off for 
all items. Thus, in our opinion only the evaluator can make an appro- 
priate choice of a useful cut-off. It, therefore, seems reasonable to 
specify the following rule as a more powerful version of Rule 13. 

Rule 17 . If RER > c^, where is a cut-off specified by 
the evaluator, then the instruction should be revised, since the data 
indicate a retention problem. If 0 ^ RER^c^, then no revision is re- 
quired. The cut-off, c^» need not be the same for all objectives. 

The one possibility that we did not consider above is RER 0; 

i.e., students make fewer errors on the posttest item than on the termi- 
nal Item. Wc stated previously, in the discussion of Rule 14 that ‘’his 

is an unlikely occurrence; however, the evaluator may want to specify a 

cut-off below which he considers this problem to be serious enough to 
merit a closer examination of the instruction. 

Rule 18 . If RER < - „ % when c^ is a cut-off specified by the 
evaluator, then the instruction should be questioned. If -c^ £ RER^O, 
then no revision is required. As before, the cut-off need not be the 
same for all objectives. 

Percentage of Maximum Possible Gain 

None of the decisions discussed up to this point has made use of 
any measure of gain in knowledge relevant to a given objective that re- 
sults from the instructional system. It Is probably true that gain is 
not as important as final performance on the posttest, in most instruc- 
tional systems; however, if students experience relatively little gain 
as a result of experiencing instruction, one can legitimately question 
the value of the instructional system itself. Thus, measures of gain 
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have long been a subject of considerable interest in the fields of pro- 
grammed instruction, computer-aided instruction., and multimedia 
instruction (see Lutnsdaine, 1965). 

The simplest measure of gain for an objective is the differ- 
ence between error rate on a pretest item (BER) and error rate on the 

9 

corresponding terminal item (IER) . Such a measure would, however, 
mean that a gain of 0.50 resulting from BER * 1.00 and IER = 0.50 would 
be indistinguishable from a gain of the same magnitude resulting from 
BER = 0.50 and IER =■ 0.00. In the former case, the Instructional system 
has failed to produce 50 percent of the gain in performance that could 
be achieved, while in the latter case, the Instructional system has pro- 
duced as much gain as possible given the entry level of the students. 
Thus, in the former case, some revision cf the instruction may be de- 
sirable, while in the latter case, no revision in the instructional 
system is required on the basis of this particular data. 

This above rather trivial example illustrates that simple gain 
does not provide a very meaningful basis for revising instruction. A 
better measure is percent of maximum possible gain for an objective, 
defined as: 



PMPG « 



BFR - IER 



BER 



( 9 ) 



In order to make use of this measure the evaluator must specify a cut- 
off that determines whether or not a given value for PMPG indicates a 
need for revision; i.e., 
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Rule 19. If PMPG < c i where c is a cut-off specified by 
3 3 

the evaluator, then the instruction should be revised. Ida cut-off c^ 
need not be the same for all objectives. 

The literature contains many in-depth discussions and debates 
about the problems and pit-falls associated with measures of gain (see, 
for example, Cronbach anc Furby, 1970). Most of this literature, how- 
ever, treats measures of gain in the context of their use in inferen- 
tial statistics or correlational analysis. While we appreciate the 
importance of these issues, we hasten to add that measures of gain, 
merely as descriptive statistics, can provide useful information to 
evaluators. We believe that the use of PMPG, as data for evaluation 
purposes, is a case in point. 

When data of the type discussed in this section are used 
along with the basic pretest, posttest, and terminal item data, then 
the appropriate decision rules are: 6-11 and 16-19. If only pretest 

and posttest data are available, then Rule 16 can be used to replace 
Rules 1-3. 



An Example 

The data reported in Table 2 are based upon the responses of 
28 students to a subset of test questions in ar, interactive CAI program 

in micro-economics developed at the Harvard Computer-Aided Instruction 

T . , 10 

Laboratory. 



Inse’rt Table 2 about here 
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The discrimination index used for both BDI and PDI is the 
phi-coefficient. In the case of BDI, all students with scores of four 
or more items correct on the pretest were classified into the upper 
criterion group, and all other students were classified into the lower 
criterion group. In the case of PDI, all students with scores of 15 
or more items correct on the posttest were classified into the upper 
group, and all students with scores of 12 or fewer items correct were 
classified into the lower group. Both BDI and PDI were tested using a 
correction for discontinuity (see Edwards, 1967, p.333) and two-tailed 
probability levels. ■ ~ 



Insert Table 3 about here 



The categorical error rates and discrimination indices given 
in Table 3 are based upon the cut-off values indicated in the footnotes 
to that table. The cut-offs used were selected primarily for illustra- 
tive purposes, and are not necessarily intended to be optimal cut-offw 
from a theoretical standpoint Note that the cut-offs are the same for 
all items. 



Insert Table 4 about here 



Table 4 lists the decisions that result from applying the 
various decision lules to three different subsets of the data reported 
in Table 3. When two rules indicate a need for revision, both are 
given; in irost other cases, only one rule is applicable. Occasions do 
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arise, however, when two or more different decisions are applicable to 
the same item or segment of instruction. For example, objective number 
five has IER = H, PER =» L and PDI = 0* According to Rule 6 the instruc- 
tion does not need revision, but Rule 12 indicates that the instruction 
is questionable. We have chosen to resolve such conflicts by selecting 
the decision that has the most serious implications for revision; i.e. , 
"ques tionable" (?) has more serious implications for revision than n do 
not revise" (NR), and "revise" (R) has more serious implications for 
revision than either "questionable" (?) or "do not revise" (NR). Thus, 
for objective number five we have labelled the instruction "questionable" 
in the second set of decisions. 

In Table 4 the first set of decisions uses more data than the 
second which, in turn, uses more data than the third. One possible 
effect of decreasing the amount of data used is illustrated by the de- 
cisions with regard to instruction for objective number five* Using 
all cf the data for objective five in Table 4, Rule 19 indicates that 
the instruction should be revised. When, however, derived error rates 
are eliminated, Rule 19 becomes inapplicable, and Rule 14 indicates that 
the instruction should be examined, but not necessarily revised. Finally, 
when both derived error rates and IER are eliminated, both Rules 19 and 
14 become inapplicable, and Rule 6 indicates that no revision is required. 
This situation is an empirical demonstration of the desirability of ob- 
taining as much data as possible in order to strengthen decisions about 
the adequacy of instruction. 
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This statement does not, however, imply that an increase in 
the amount of available data will necessarily increase the number of 
decisions involving the revision (R) of items or instruction* Consider, 
for example, the decisions involving instruction, given in Table 4, for 
objective number 11, Using only pretest and posttest data, no revision 
is required according to Rule 6* When IER is included as data for 
decision making, the second set of decisions indicate that the instruc- 
tion is questionable according to Rule 14, When, however, all available 
data are used (i.e., pretest- and posttest data, IER, and the derived 
error rates), we again arrive at the decision "no revision" according 
to Rules 6 and 18^. Clearly, in the case of objective 11, an increase 
in the amount of available data ultimately confirms our initial Judgment 
that no revision of instruction is required. 

For this particular instructional system. Table 4 indicates that 
the availability of derived error rates increases the number of deci- 
sions that involve revision of items and instruction. Furthermore, in 
general, revision is most often necessitated by relatively poor perfor- 
mance on the posttest (note the many times Rules 8 and 9 are employed) 
and relatively poor retention (note the many times Rules 13 and 17 are 
employed). Also, the instruction seems to be in more need of revision 
than the test items. These general observations do, in fact, coincide 

with the predictions of the person responsible for developing this par- 

12 

ticular instructional program* 
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Discussion 

It is certainly reasonable to expect that some readers may 
feel that certain decisions we have proposed are not appropriate for 
their particular programs, or that other decision rules should be 
added. We have tried to specify those decisions that we feel are the 
most universally applicable; however, even more important than the 
actual decision rules presented is the method used to arrive at deci- 
sions about test items and instruction. Hopefully this method is 
gener alizable . 

In this section we will discuss various factors that have 
applicability to the rules we have presented and the decision process 
we have proposed. 

Instructional Systems and Criterion-Referenced Testing 

One might define an instructional system in general as a 
replicable method of instruction providing feedback that can be used 
for revision purposes. Such systems are usually characterized by a 
close correspondence between test items and behavioral objectives, i.e 
test items are criterion-referenced. In addition, it iu usually expec 
ted that ’'most" of the students will get ’’most" of the terminal and 
posttest items correct. 

Brennan (1970) and Popham 6 Husek (1969) have examined some 
aspects of the applicability of classical test theory to the analysis 
of criterion-referenced tests. Perhaps the most important implication 
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of these analyses is that the classical normality assumptions concern- 
ing errors of measurement do not S';em to be appropriate in the criterion- 
referenced testing situation; the errors of measurement seem to be 
better characterized by binomial error models (see Lord & Novick, 1968, 
Chapter 23). This means that many of the statistics used in classical 
test theory are not applicable in the cri terion- referenced testing situ- 
ation. Tor example, the biseriaJ discrimination index is not appropriate 
for cri terion- referenced test data, since total scores on the test are 
not necessarily normally distributed; a similar comment can be made about 
the tetrachoric discrimination index. 

Another characteristic of a good instructional system is 
that all students who receive instruction achieve criterion 

performance on the posttest regardless of previous knowledge or experi- 
ence (see Stolurow & Davis, 1965). Ideally, in fact, we may want all 
students to achieve all objectives, In such a situation all items would 
be non-discriminating (assuming, of course, that total test score is 
the criterion used for judging discriminability) . This line of reason- 
ing indicates why we have specified that non-discriminating items do not 
indicate a need for revision. Conversely, items that are significantly 
discriminating (especially negatively discriminating items) indicate a 
possible need for revision since the instructional system is performing 
worse for one group of students than for another group* 

Corresponding Items 

When discussing the context of the rationale that has been 
presented, we assumed that for each objective there exists a pretest, 
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posttest, and terminal test item; furthermore , we assumed that the 
items testing a given objective are, in some sense, "corresponding, 1 ' 
"equivalent", or "parallel". 

The terms "equivalent" and "parallel" are, in the classical 
sense, usually applied to tests. A set of k tests are said to be 
"parallel" or "equivalent 11 if they have equal means, equal variances, 
and equal intercorrelations (see Gulliksen, 1950, p, 173), This does 
not mean, however, that there is necessarily any strict correspondence 
among items in the k tests. Thus, in the rationale that we have pro- 
posed, and in criterion-referenced testing in general, the classical 
concept of parallel tests is clearly not sufficient, since we are very 
concerned about the performance of students on individual items, not 
just entire tests. Let us, therefore, reserve the terms "parallel" 
and "equivalent" for entire tests, and examine the analogous issue of 
"corresponding" items. 

We can define "corresponding" items, in general, as items 
that measure the same thing. Clearly, then, one requirement of corres- 
ponding items is that, in the judgment of specialists the items measure 
the same behavioral objective. Furthermore, Just as we have a statis- 
tical criterion for parallel tests, it seems reasonable to have a simi- 
lar statistical criterion for corresponding items. Thus, another 
reasonable requirement for corresponding items would seem to be that 
they have equal means, equal variances, and equal intercorrelations. 
Since we are assuming that items are scored dichotomously , the mean of 
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item i is simply the proportion of correct responses (p^) and the 
correlation between any two items is the phi correlation (r^)« 

Now, suppose we give a set of k tests to N students in order 
to determine whether or not the tests are parallel; i.e., whether or 
not the set of k means, k variances, and k(k - 1 ) / 2 intercorrelations 
are equal except for sampling differences. Wilks (1946) provides a 
statistical test to answer this question. 

Unfortunately, however, Wilks r test is not applicable for 
judging the equality of a set of means, variances, and intercorrelations 
fork dichotomously scored criterion-referenced items. Wilks 1 test 
assumes a normal multivariate population distribution, and, as we have 
stated previously, the assumption of normality is probably inappropriate 
in the criterion-referenced testing situation. 

As far as we know, there is no currently available method for 
simultaneously testing the equality of means, variances, and correla- 
tions among dichotomously scored items that are not necessarily normally 
distributed. We can, however, approach a solution to the problem by 
applying what is us'ially called Cochran's Q Test (see Siegel, 1956, pp. 
161-166), which is a test for the equality of means, or proportions 
(pj), among dichotomized variables (in this case, test items). 

Since the variance of a dichotomous variable scored zero or 
one is completely determined by the mean (or proportion of successes), 
it is clear that if the means of k items are equal, then the variances 
will also be r^ual. However, even if the means and variances of k items 



25 



26 . 



are equal (except for sampling differences), this does not necessarily 
mean that the intercorrelations are equal. The authors have no know- 
ledge of any currently available method to test the equivalence of 
intercorrelations (phi-coe cficients) among dicho tomously scored items 
which may not be distributed normally in the population. 

Besides the problem of non-normally distributed variables 
there is another problem in testing the equivalency of intercorrelations 
(phi-coefficients) that may not be immediately evident. Suppose we have 
three items (i.e., k + 3) . *In order to test whether or not the inter- 
correlations armng the items are the same, we must take into account 
three different phi-coefficients: (a) r^ between item one and item 

two, (b) r^ between item one and item three, and (c) z^ between item 
two and item three. Now, it is clear that (a) and (b) are correlated 
because both phi-coefficients are based on the same data for item one; 
(a) and (c) are also correlated since they are based on the data for 
item two; and finally, (b) and (c) are correlated because they are based 
on the same data for item three. Since the three r^ T s are clearly cor- 
related, we cannot apply any of the well-known chi-square tests that are 
currently available for use with contingency tables. In the absence of 
a test of significance for examining the equivalence of intercorrela- 
tions (phi-coefficients) among k items, the evaluator will probably have 
to use his best judgment about whether or not the phi-coefficients are 
''approximately' 1 equal. 

In summary, we have defined corresponding items as items that 
(a) measure the same behavioral objective, (b) have the same means, (c) 
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have the same variances, and (d) have the same in tercorrel ations . We 
have recommended Cochran's Q Test as a method for testing (b; and (c), 
but we are unable to specify a method for testing (d). In practice, . 
however, the lack of a statistical test for (d) may not be too serious 
a limitation. Certainly, if conditions (a), (b), and (c) are fulfilled 
and the intercorrelations among the items are approximately equal, it 
is reasonable to assume that the items are "corresponding" . 

Comments on Data for Decision Makj ng 

For purposes of simplicity, the decision rules we have 
specified are based upon data from one pretest item, one terminal item, 
an! one posttest item for each objective. There may, of course, be more 
than one pretest, terminal, and/or posttest item for any given objective. 
Such additional data ca r be taken into account in various ways. For 
example, one might merely combine the data from all the pretest (post- 
test or terminal) items relevant to a given objective in order to cal- 
culate the appropriate error rate. Alternatively, assuming, for example, 
that three posttest items test the same terminal objective, one might 
specify that if a student answers two of the three items correctly, then 
he has achieved the objective. Other alternatives are also possible; 
however, a multiplicity of pretest, posttest, or terminal items relevant 
to a given objective can complicate the interpretation of which item, if 
any, requires revision. 

We have also assumed that every student answers every item. 
There arc several formulas available (see Giulford, 1954, pp. 418-424) 
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that can be used to calculate error rates with missing data. Such 
formulas can be used Instead of Equation 2. A large amount of missing 
data can, however, present serious problems, especially if the sample 
size is small. 

There are many discrimination indices available in the liter- 
ature (see Guilford, 1954, pp. 424-440) than could be used to calculate 
BDI and PDI . In our opinion, however, the phi-coefficient and the B 
index (see Brennan 1970, 1972 in press) are the best indices to use with 
criterion-referenced tests, since they make only weak distributional 
assumptions, and they allow the evaluator to specify virtually any cut- 
off between upper and lower groups. In addition, the index B has a 
very useful interpretation in terms of the number of discriminations 
made by an Item. 

One further comment seems appropriate. Stoluiow and Frincke 
(1966) have noted that there is a danger of rejecting good items (or 
good instruction) when the sample size is relatively small, say N - 15 
or 20. In their study, Stolurow and Frincke were concerned about error 
rates only. Since, In this paper we examine both error rates and dis- 
crimination indices, It is certainly desirable that the sample sirte be 
sufficiently large. We believe that an N of about 25 or 30 should be 
adequate for most purposes. The technique we have proposed can be used 
with smaller sample sizes; however, the certainty with which decisions 
can be made is thereby reduced. 
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footnotes 



3.1. 



Much of the research reported herein was performed pursuant 
to contracts with the United States Naval Academy, Contract No, 
N00161-70-C-0119 , and the Office of Naval Research, Contract No, 

N000 14 -6 7-A-O 29 8-0 032 , 

2 

A posttest item, as we are using the term, is, in part, a 
measure of retention. Clearly, the evaluator must temper his decisions 
about revision with knowledge abovt the length of time intervening be- 
tween instruction and testing as well as the criticality of forgetting. 

3 

Items that have a viitual infinitude of possible answers 
have TER * 1.00; however, the evaluator should be careful not to assume 
that ever/ f ree-response or open ended test item has TER *= 1.00. Very 
often such items are so worded that only two or three answers are 

really possible, in which case TER c 0.5 0 or TER 0.67. 

4 

Terminal Error Rate would be a more descriptive phrase than 
Instructional Error Rate; however, we have chosen the latter to avoid 
the ambiguity involved in having TER stand for both Terminal Error Rate 
and Theoretical Error Rate. 

** These decisions should not, however, be interpreted too 
strictly; the evaluator will still have to use some degree of subjective 
Judgment. For example, when we say, in subsequent discussions, that an 
item should be revised (R), we mean that our best guess on the basis of 
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the data is that the item should be revised. This does not mean, how- 
ever, that the item should be revised without a logical basis for revi- 
sion. Also, when we say that an item (od instruction) is questionable 
(?), we mean that the data are not sufficient to make a definite judgment 
about whether or not the item (or instruction) should be revised. 

** It is also possible that the item has neither of these faults 
and the objective, while being easy for most of the students, is consi- 
dered to be an integral part of the total set of objectives. In this 
case, the item would not be revised. A similar statement can be made 
for Rule 16 which will be discussed later. 

^ It is also possible that the terminal item and posttest item 
are not measuring the same content el the same level of difficulty, even 
though this is an assumption underlying all the decision rules presented 
here. 

The term - 1/2N in Equation 7 is a correction for discontinuity 
and, as such, can be dropped if the sample size is large. Note that when 
TER « 1.00 Z is undefined; in this case any value of DER 0 can be 

considered significant. 

9 

One could make a case for using error rate on the posttest 
item (PER) rather than error rate on the terminal item; then, however, 

PER - BER would involve a confounding of gain with retention, as we are 
using the terms in this paper. 

^ We are grateful to Mr. Eugene Mlllstein for developing the 
instructional program and collecting the data pursuant to a contract 
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with the Office of Naval Research, ONR Contract No. N00014-67-A-0298-0003. 
Our analysis of the data should not, however, be interpreted as an evalu- 
ation of the program. 

H Recall that when derived error rates are available Rules 16-19 
replace Rules 1-3 and 12-15, since the former rules are more exact state- 
ments of the latter rules. More specifically, Rule 18, in effect, replaces 
Rule 14. For objective number 11, application of Rule 18 indicates no 
need for revision, which overrides the decision made on the basis of Rule 
14. 

The reader will note that, in Table 4, if two or more rules 
indicate "no revision (NR) 11 , we have identified only that rule which we 
believe is most important. There seems to be no particular advantage in 

identifying all the possible reasons for doing nothing! 

12 

This program is being used primarily as a vehicle for testing 
a psychological theory of sequencing instruction. As such, the program 
has been purposely written to discriminate among students who have 
experience] different instructional sequences; the program is not meant 
to teach micro-economics to all students in the most effe.tive manner. 
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TABLE 1 

Rules for Decision-Making 



Rule 



Error Rates 



Decision 



a 



No. TER BER BDI IER PER PDI Item Instruction Prerequisites 



2 

3 

4 

5 

6 

7 

8 
9 

10 

11 



H H 

L L 

L H 

K L 



L 

L 

L 

H 

H 

K 



0 

+ 



+ 

0 

+ 



NR 

NR 

NR 

R 

? 



NR 

? 

? 

R 

? 

? 



NR 

? 

? 

R 

R 

R 



E 

E 
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TABLE 1 (cont'd) 
Rules for Decision-Making 



Rule 

No. 


Error Rates 




Decision 


TER BER BDI IER PER PDI 


Item 


Instruction Prerequisites 


12 


L L 




NR 


13 


L H 




R 


14 


H L 




? 


15 


H H 




R 


— - 


—— — — — —— — — 


— — 


— — — 


16 


DER* b 


R 






DER(NS) “ 


NR 




17 


RER> c x 




R 




0 £RER£ c 1 




NR 


18 


rer<~c 2 




? 




-c 2 £ RER5 0 




NR 


19 


PMPG <c 3 




R 




PMPGic 3 




NR 



Q 

"NR" means no revision required. 

,f R" means revision is required. 

"?" means the data are not sufficient to make a sound judgment about 
whether or not revision is required. 
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,r E’ r means the prerequisites for the objective should be examined, 

^DER is significantly greater than zero at the .03 level for a one- 
tailed test of significance, 

C DER is not significantly greater than zero at the ,03 level for a one- 
tailed test of significance. 
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TABLE 2 



Error Rates and Discrimination Indices for a CAI Program in Micro-Economics 



Objective 

Number 

# 


Raw Error Rates and Di 


s; crimination Indices 


Derived 
Error Rates 


TER 


ber 


BDI 


IER 


PER 


PDI 


DER 


RER 


PMPG 


1 


1.000 


.750 


.365 


.250 


.071 


.205 


.250* 


-.179 


.667 


2 


.875 


.964 


.304 


.143 


.643 


. 880** 


-.089 


.500 


.852 


3 


'. 750 


. 786 


.055 


.107 


.106 


.205 


-.036 


-.001 


.864 


4 


.750 


.714 


,650** 


.214 


.036 


.141 


.036 


-.178 


.700 


5 


.500 


.604 


. 786** 


.321 


.000 


.000 


-.104 


-.321 


*469 


6 


.667 


.714 


.475* 


.107 


.179 


.015 


-.04 7 


.072 


.850 


7 


* 750 


.893 


.548* 


.321 


.393 


.510 


-.143 


.072 


.641 


8 


1.000 


1.000 


.000 


.286 


.607 


.535 


.000 


.321 


.714 


9 


.500 


.857 


-.032 


.071 


.179 


.309 


-.357 


.108 


.917 


10 


.500 


.500 


.000 


.000 


.214 


.309 


.000 


.214 


1.000 


i: 


1.000 


.964 


.304 


.321 


.143 


.015 


.036* 


“.178 


.667 


12 


1.000 


.929 


.132 


.286 


.429 


.459 


.071* 


.143 


.692 


13 


.500 


.821 


. 737** 


.214 


.214 


.357 


-.321 


.000 


. 739 


14 


1.000 


.857 


.420 


.607 


.464 


.630* 


.143* 


-.143 


.292 


13 


.500 


.679 


.580** 


.143 


.214 


.357 


-.179 


.071 


. 789 


16 


1.000 


1.000 


.000 


.143 


.500 


.535 


.000 


.357 


, 83 7 


17 


1.000 


1.000 


.000 


.393 


.964 


.394 


.000 


.571 


.607 


18 * 


.875 


.964 


.304 


.679 


* 786 


.397 


-.089 


.107 


.296 


A 


p<.05 




** P <. 


01 




0 
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TABLE 3 

Categorical Error Rates and 

Discrimination Indices for a CAI Program in Micro-Economics 



Objective 
Numb e r 


Raw 


Error Rates 


and 


Discrimination 


Indices 3 




Derived 
Error Rates 




TER 


BER 


BDI 


IER 


PER 


PDI 


DERb 


rfr c 


PMPG " 


if 

* 


H 


H 


0 


L 


L 


0 


* 


- 


- 


c 

2 


H 


H 


0 


L 


H 


+ 


- 


GT 




3 


H 


H 


0 


L 


L 


0 


- 


- 


- 


4 


H 


H 


+ 


L 


L 


0 


- 


- 


- 


5 


H 


H 


+ 


H 


L 


0 


- 


LT 


LT 


6 


H 


H 


+ 


L 


L 


0 


- 


- 


- 


7 


H 


H 


+ 


H 


H 


0 


- 


- 


- 


8 


H 


H 


0 


L 


H 


0 


- 


GT 


- 


9 


H 


H 


0 


L 


L 


0 


- 


- 


- 


10 


H 


H 


0 


L 


L 


0 


- 


GT 


- 


11 


H 


H 


0 


H 


L 


0 


A 


- 


- 


12 


H 


K 


0 


L‘ 


H 


0 


A 


- 


- 


13 


H 


H 


+ 


L 


L 


0 


- 


- 


- 


14 


D 


H 


0 


H 


H 


+ 


A 


- 


LT 


15 


H 


H 


4 


L 


L 


0 


- 


- 


- 


16 


H 


H 


0 


L 


H 


0 


- 


GT 


- 


17 


B 


H 


0 


H 


H 


0 


- 


GT 


- 














4 








18 


H 


H 


0 


H 


H 


0 


- 


- 


LT 
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^fhe cut-off value for TER, BER, 1ER, and PER is D.30. 

k"-" indicates that DER is not significantly greater than zero at the ,05 
level for a one-tailed test of significance, 

Cm GT" indicates that RER is "greater then” 0,20, 

”LT” indicates that RER is ”less than” -0,30, 
indicates that -0 . 30 £ RER^O , 20, 
d "LT" indicates that PMPG is "less than" 0.60. 

M -” indicates that PMPG is greater than or equal to 0.60. 

* p <.05 
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Revision Decisions by Objective for a CAI Program in Micro-Economi 




N 1(6) NR(6) NR(6) NR(12) NR(6) NR(6) 



